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o 

S^' We develop an efficient estimation procedure for identifying and 

■^^ , estimating the central subspace. Using a new way of parameteriza- 

tion, we convert the problem of identifying the central subspace to 
the problem of estimating a finite dimensional parameter in a semi- 
parametric model. This conversion allows us to derive an efficient es- 
^H ' timator which reaches the optimal semiparametric efficiency bound. 

^^ ^ The resulting efficient estimator can exhaustively estimate the cen- 

tral subspace without imposing any distributional assumptions. Our 
proposed efficient estimation also provides a possibility for making iri- 
2 . ference of parameters that uniquely identify the central subspace. We 

conduct simulation studies and a real data analysis to demonstrate 
the finite sample performance in comparison with several existing 
methods. 

> 

cn ' 1. Introduction. Consider a general model in which the univariate re- 

^^ . sponse variable Y is assumed to depend on the p-dimensional covariate vec- 

f^ I tor X only through a small number of linear combinations (3 x, where (3 

is a p X d matrix with d < p. In this model, how Y depends on /3 x is left 
f^ ] unspecified. It is not difficult to see that f3 is not identifiable. The quan- 

m ' tity of general interest is usually the column space of /3, which is termed 

the central subspace if d is the smallest possible value to satisfy the model 

assumption [5]. 
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2 Y. MA AND L. ZHU 

This general model was proposed by Li [12] and has attracted much at- 
tention in the last two decades. It generated the field of sufficient dimension 
reduction [5], in which the main interest is to estimate the central sub- 
space consistently. Influential works in this area include, but are not limited 
to, sliced inverse regression [12], sliced average variance estimation [6], di- 
rectional regression [10] , the generalization of the aforementioned methods 
to nonelliptically distributed predictors [7, 9], Fourier transformation [30], 
cumulative slicing estimators [29] and conditional density based minimum 
average variance estimation [26], etc. 

Despite the various estimation methods, it is unclear if any of these es- 
timators are optimal in the sense that they can exhaustively estimate the 
entire central subspace and have the minimum possible asymptotic estima- 
tion variance. To the best of our knowledge, the efficiency issue has never 
been discussed in the context of sufficient dimension reduction. 

In this paper we study the estimation and inference in sufficient dimen- 
sion reduction. We propose a simple parameterization so that the central 
subspace is uniquely identified by a (p — d)d-dimensional parameter that is 
not subject to any constraints. Thus we convert the problem of identify- 
ing the central subspace into a problem of estimating a finite dimensional 
parameter in a semiparametric model. This allows us to derive the estima- 
tion procedures and perform inference using semiparametric tools. How to 
make inference about the central subspace is a challenging issue. This is 
partially caused by the complexity of estimating a space rather than a pa- 
rameter. Our new parameterization overcomes this complexity and permits 
a relatively straightforward calculation of the estimation variability. 

We further construct an efficient estimator, which reaches the minimum 
asymptotic estimation variance bound among all possible consistent esti- 
mators. Efficiency bounds are of fundamental importance to the theoretical 
consideration. Such bounds quantify the minimum efficiency loss that re- 
sults from generalizing one restrictive model to a more flexible one, and 
hence they can be important in making the decision of which model to use. 
The efficiency bounds also provide a gold standard by which the asymptotic 
efficiency of any particular semiparametric estimator can be measured [22] . 
Generally speaking, a semiparametric efficient estimator is usually the ulti- 
mate destination when searching for consistent estimators or trying to im- 
prove existing procedures. When an efficient estimator is obtained, the pro- 
cedure of estimation can be considered to have reached certain optimality. 

In the literature, vast and significant effort has been devoted to studying 
the semiparametric efficiency bounds for consistent estimators in semipara- 
metric models. The simplest and most familiar examples are the ordinary 
and weighted least square estimators in the linear regression setting. Effi- 
ciency issues are also considered in more complex semiparametric problems 
such as regressions with missing covariates [23], skewed distribution fami- 
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lies [18, 19], measurement error models [15, 25], partially linear models [16], 
the Cox model [24], page 113, accelerated failure model [27] or other general 
survival models [28] and latent variable models [17]. 

One typical semiparametric tool is to obtain estimators through obtain- 
ing the corresponding influence functions. In deriving the influence function 
family and its efficient member, we use the geometric technique illustrated 
in [2] and [24] . All our derivations are performed without using the linearity 
or constant variance condition that is often assumed in the dimension reduc- 
tion literature. Our analysis is thus readily applicable when some covariates 
are discrete or categorical. In summary, we provide an efficient estimator 
which can exhaustively estimate the central subspace without imposing any 
distributional assumptions on the covariate x. 

The rest of this paper is organized as follows. In Section 2, we propose 
a simple parameterization of the central subspace and highlight the semi- 
parametric approach to estimating the central subspace. We also derive the 
efficient score function. In Section 3, we present a class of locally efficient es- 
timators and identify the efficient member. We illustrate how to implement 
the efficient estimator to reach the optimal efficiency bound. Simulation 
studies are conducted in Section 4 to demonstrate the finite sample perfor- 
mance and the method is implemented in a real data example in Section 5. 
We finish the paper with a brief discussion in Section 6. All the technical 
derivations are given in a supplementary material [21]. 

2. The semiparametric formulation. 

2.1. Parameterization of central subspace. In the context of sufficient 
dimension reduction [5, 12], one often assumes 

(2.1) F(2/|x) = F(y|/3Tx) for y G R, 

where F{y\x) = Pr(y < y|x) is the conditional distribution function of the 
response Y given the covariates x, and f3 is a p x d matrix as defined previ- 
ously. The goal of sufficient dimension reduction is to estimate the column 
space of (3, which is termed the dimension reduction subspace. Because a 
dimension reduction subspace is not necessarily unique, the primary inter- 
est is usually the central subspace 5y|x, which is defined as the minimum 
dimension reduction subspace if it exists and is unique [5] . The dimension of 
5y|x, denoted with d, is commonly referred to as the structural dimension. 
Similarly to [4], we exclude a pathological case where there exists a vector a 
such that a^x. is a deterministic function of /9 x while a does not belong 
to the column space of f3. 

The central subspace Syix has a well-known invariance property [5], 
page 106, that is, Syix = D^yiz, where z = D x + b for any p x p non- 
singular matrix D and any length p vector b. This allows us to assume 
throughout that the covariate vector x satisfies E{x.) = and cov(x) = Ip. 



4 Y. MA AND L. ZHU 

Identifying Syix is the essential interest of sufficient dimension reduction 
for model (2.1). Typically, Syi^ is identified through estimating a basis ma- 
trix (3 G W^'^ of minimal dimension that satisfies (2.1). Although Syix is 
unique, the basis matrix /3 is clearly not. In fact, for any d x d full rank 
matrix A, /3A generates the same column space as f3. Thus, to uniquely 
map one central subspace Syix to one basis matrix, we need to focus on one 
representative member of all the (3A matrices generated by different A's. 
We write j3 = {(3^ ,Pi )""", where the upper submatrix /3„ has size d x d and 
the lower submatrix (3i has size [p — d) x d. Because /3 has rank d, we can 
assume without loss of generality that /3„ is invertible. The advantage of 
using /9/3^ is that its upper d x d submatrix is the identity matrix, while 
the lower (p — d) x d matrix can be any matrix. In addition, two matrices 
(3i(3i^ and /32/^2n ^^"^ different if and only if the column spaces of /3^ and 
P2 a.re different. Therefore, if we consider the set of all the p x d matrices 
/3 where the upper d x d submatrix is the identity matrix I^, it has a one- 
to-one mapping with the set of all the different central subspaces. Thus, as 
long as we restrict our attention to the set of all such matrices, the prob- 
lem of identifying Sy\-^ is converted to the problem of estimating /3;, which 
contains Pt = {p — d)d free parameters. Note that pt is the dimension of the 
Grassmann manifold formed by the column spaces of all different (3 matri- 
ces. Thus, we can view /3; as a unique parameterization of the manifold. 
Here the subscript "j" stands for total. For notational convenience in the 
remainder of the text, for an arbitrary p x d matrix f3 = {(3^,(3i )^, we de- 
fine the concatenation of the columns contained in the lower p — d rows of 
(3 as vecl(/3) = vec(/3;) = (/3d+i,i, . . . , /3p,i, . . . , I3d+i4, ■■■, Pp,d)^ , where in the 
notation vecl, "vec" stands for vectorization, and "1" stands for the lower 
part of the original matrix. We then can write the concatenation of the pa- 
rameters in /3 as vecl(/3). Thus, from now on, we only consider basis matrix 
of 5y|x that has the form (3 = (Jd^l^i )^i where f3i \s a. {p — d) x d ma- 
trix. Estimating the parameters in /3 is a typical semiparametric estimation 
problem, in which the parameter of interest is vecl(/3). Therefore we have 
converted the problem of estimating the central space 5y|x into a problem 
of semiparametric estimation. 

Remark 1. The above parameterization of 5y|x excludes the patholog- 
ical case where one or more of the first d covariates do not contribute to the 
model or contribute to the model through a fixed linear combination. When 
this happens, /3„ will be singular. However, because (3 has rank d, hence if 
this happens, one can always rotate the order of the covariates (hence rotate 
the rows of (3) to ensure that after rotation, the resulting /3„ has full rank. 

2.2. Efficient score. In this section we derive the efficient score for es- 
timating /3 under the above parameterization. That is, we now consider 
model (2.1), where f3 = (i-d^Pi )"*" ^-ncl x satisfies E{x.) = and var(x) = Ip. 
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The general semiparametric technique we use is originated from [2] and is 
wonderfully presented in [24] . Using this approach, we obtain the main result 
of this section, that we can use (2.2) to obtain an efficient estimation of j3. 

The likelihood of one random observation (x, 1") in (2.1) is 7ji{x.)r]2{Y, P x), 
where rji is a probability mass function (p.m.f.) or a probability density func- 
tion (p.d.f.) of X, or a mixture, depending on whether x contains discrete 
variables, and r]2 is the conditional p.m.f./p.d.f. of Y on x. We view 771,772 as 
infinite dimensional nuisance parameters and vecl(/3) as the pj-dimensional 
parameter of interest. Following the semiparametric analysis procedure, we 
first derive the nuisance tangent space A = Ai © A2, where 

Ai = {f(x) : Vf such that E{i) = 0}, 

A2 = {f (y, /J'^x) : Vf such that E(f |x) = £(f |/3^x) = 0}. 

Here, the notation ® means the usual addition of the two spaces Ai, A2, 
while Ai and A2 have the extra property that they are orthogonal to each 
other. This means the inner product of two arbitrary functions from Ai and 
A2, respectively, calculated as the covariance between them, is zero. We then 
obtain its orthogonal complement 

A^ = {f(y,x) - £;(f|/3^x,y) :^(f|x) = E(f|/3^x),Vf}. 

The detailed derivation of A and A-*- is given in Appendix A. 2 of [20]. The 
form of A permits many possibilities for constructing estimating equations. 
For example, for arbitrary functions gj and a^, the linear combination 

k 

J]{gi(y,/3Tx) - i=;(g,|/3Tx)}{«,(x) - ii;(«,|/3Tx)} 

will provide a consistent semiparametric estimator since it is a valid element 
in A . This form is exploited extensively in [20] to establish links between the 
semiparametric approach and various inverse regression methods. Among all 
elements in A^, the most interesting one is the efficient score, defined as the 
orthogonal projection of the score vector S/3 onto A . We write the efficient 
score as Sefj = n(S^| A^). Because the efficient score can be normalized to the 
efficient influence function, it enables us to construct an efficient estimator 
of vecl(/9) which reaches the optimal semiparametric efficiency bound in the 
sense of [2]. In the supplementary document [21], we derive the efficient score 
function to be 

51og{r72(y,/3^x)}" 



(2.2) Seff(y,x,/3^x,772) = vecl 



{x-i?(x|/3^x)}- 



5(xT/3) 

Hypothetically, the efficient estimator can be obtained through implement- 
ing 



^Scff(yi,Xi,/3'^Xi,772) =0. 



i=l 
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However, Sefr is not readily implementable because it contains the unknown 
quantities £^(x|x'^/3) and 9 log T72 (5^, /3 x)/5(x'^/3). For this reason, we first 
discuss a simpler alternative in the following section. 

3. Locally efficient and efficient estimators. 

3.1. Locally efficient estimators. We now discuss how to construct a lo- 
cally efficient estimator. This is an estimator that contains some subjectively 
chosen components. If the components are "well" chosen, the resulting esti- 
mator is efficient. Otherwise, it is not efficient, but still consistent. The effi- 
cient estimator defined in (2.2) requires one to estimate 772, the conditional 
p.d.f. of y on /3 X, and its first derivative with respect to (3 x. Although 
this is feasible, as we will describe in detail in Section 3.2, it certainly is not 
a trivial task as it involves several nonparametric estimations. Because of 
this, a compromise is to consider an estimator that depends on a posited 
model of 7J2 ■ Specifically, we would choose some favorite form for 7/2 , denoted 
r]2{Y,(3 x), and utilize it in place of 772 to construct an estimating equation. 
If the posited model is correct (i.e., 7/2 = 7?2)) then we would have the optimal 
efficiency using the corresponding S*g. However, even if the posited model 
is incorrect (i.e., 7/2 7^^/2)1 we would still have consistency using the corre- 
sponding S*g. A valid choice of S*g that indeed guarantees such property is 



vecl {xi-E(xi|^Tx,)} 



aiog{7?^(y,,/3Tx,)} ^raiog7?2*(y„/3Txi 



/3^xi 



5(xT/3) 1 5(xT/3) 

When 7?2* = 7?2, E{dlogrj*2{Yi,(3'^^i)/d{^Jp)\(3'^j^.i} = 0, hence S*jj = Seff. 
The construction of a locally efficient estimator is often useful in practice 
due to its relative simplicity. S*jj is almost readily applicable except that 
the two expectations E{xi\P^Xi) and ii'{51ogry|(li,/3'^Xj)/9(x^/3)|/3"^Xj} 
need to be estimated nonparametrically. One can use the familiar kernel 
or local polynomial estimators. In Theorem 1, we show that under mild 
conditions, with the two expectations estimated via the Nadaraya- Watson 
kernel estimators, the local efficiency property indeed holds and estimating 
the two expectations does not cause any difference from knowing them in 
terms of its first order asymptotic property. 

We first present the regularity conditions needed for the theoretical de- 
velopment. 

(Al) {The posited conditional density 77^). Denote u = /3 x. The posited 
conditional density 772 {Y, u) of Y given u is bounded away from and infinity 
on its support 3^. The second derivative of logr]2{Y,\i) with respect to u is 
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continuous, positive definite and bounded. In addition, there is an open set 
fl G RP* which contains the true parameter vecl(/3), such that the third 
derivative of r/2 {Y, j3 x) satisfies 

|53{r/2*(y,/3Tx)}/(5vecl(/3), 9vecl(/3)fc9vecl(/3)0| < Af;,,(y,x) 

for all vecl(/3) e fi and 1 < j, k,l<pt, where Mj'^,(y,x) satisfies E{M*f^i^{Y, 
x)} < oo, and fij is the jth component of vecl(/3). 

(A2) (T/ie nonparametric estimation). E{dlogr]2{Y,P x)/3(x''^/3)|/3 x} 
and ^(x|/3 x) are estimated via the Nadaraya-Watson kernel estimator. 
For simplicity, a common bandwidth h is used which satisfies nh^ — t- and 
nh'^'^ — )■ oo as n — >• oo. 

(Bl) (The true conditional density 772). The true conditional density 
i]2{Y, u) of Y given u is bounded away from and infinity on its support 3^. 
The first and second derivatives of log rj2 satisfy 

d{iogm{Y,f3^^)y 



E 
and 



3vecl(/9) 







E 



a{logr?2(y,/3^x)} a{logr?2(r,/3^x)} 



-E 



52 {log 772(1^, /3'x)} 



5vecl(/3)5vecl(/3)^ 



5vecl(/3) 9vecl(/3)T 

is positive definite and bounded. In addition, there is an open set f2 € R^' 
which contains the true parameter vecl(/3), such that the third derivative of 
r]2{Y,f3 x) satisfies 

\d^r]2{Y,P^^)}/{dved{(3)jdved{p)kdYecl{P)i)\ < Mjki{Y,^) 

for ah vecl(/3) G fJ and l<j,k,l<pt, where Mjki{Y,x) satisfies E{Mf^i{Y, 
x)} < 00, and I3j is the jth component of vecl(/9). 

(B2) (The bandwidths). The bandwidths satisfy /ly — t- 0, 6—7-0 and h^ — s- 
0, and n/i^+26 ^ 00, nV2{/i2 + (^/jd)-i/2||/j2 ^ 52 ^ (^/jd+25)-i/2| ^ q^ 

(CI) {The density functions of covariates). Let u = /3 x. The density 
functions of u and x are bounded away from and infinity on their support 
U and X where U = {u = (3 x:xE X} and Af is a compact support set of 
X. Their second derivatives are finite on their supports. 

(C2) {The smoothness). The regression functions £'(x|u) has a bounded 
and continuous derivative on U. 

(C3) {The kernel function). The univariate kernel function K{-) is a 
bounded symmetric probability density function, has a bounded derivative 
and compact support [—1,1], and satisfies iJ,2 = f u'^K{u)du y^ 0. The d- 
dimensional kernel function is a product of d univariate kernel functions, that 
is, K{u) = UUi K{uj), and Kh{vL) = n^ti Kh{uj) = /i-''n?=i K{uj/h) for 
u = {ui, . . . , u^)^ and any bandwidth h. 
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Theorem 1. Under conditions (Al)~(A2) and (C1)~(C3), the estima- 
tor obtained from the estimating equation 

n 
i=l 

is locally efficient. Specifically, the estimator is consistent if rJ2 i^'f]^, and is 
efficient i/r/| = r]2- In addition, using the estimated E{-\(3 x) results in the 
same estimation variance for vecl(/3) as using the true E{-\/3 x). Specifi- 
cally, the estimate (3 satisfies 

V^{ved0) -vecliP)} ^ N{0,A-^B{A-^f} 

when n — )• oo, where 

A-ii< dvecUd)'^ r ■t5-^i^cfflJ^i)Xi,/:i Xj,r/2J |. 

In Theorem 1 and thereafter, we use v®^ to denote vv for any matrix or 
vector V, and use E to denote the nonparametrically estimated expectation. 

We describe how to implement the locally efficient estimator in several 
specific cases. For example, when Y is continuous, we can propose a sim- 
ple conditional normal model for r]2 and hence obtain the locally efficient 
estimator based on summing terms of the form 

S:ff(y,x,/3Tx,r?*) 
(3.1) 



:vecl({x-£;(x|/3^x)} 



\y-Eiyi,^.n'-m^ 



9(xT/3) 

evaluated at different observations. Here E*{-\f3 x) is computed using the 
model ??2- When Y is binary, a common model to posit for rj2 is a logistic 
model. The summation of the terms of form (3.1) evaluated at different 
observations also provides a locally efficient estimator. When y is a counting 
response variable, the Poisson model is a popular choice for 772- This choice 
also yields an identical locally efficient estimator formed by the sum of (3.1). 
The benefits of these locally efficient estimators are two-fold. The first benefit 
lies in the robustness property, in that they guarantee the consistency of the 
resulting estimators regardless of the proposed model. The second benefit 
is their computational simplicity gained through avoiding estimating the 
conditional density r/2 and its derivative. In addition, if, by luck, the posited 
model happens to be correct, then the estimator is efficient. 

Remark 2 . We have restricted the posited model 772 to be a completely 
known model in order to illustrate the local efficiency concept. In fact, one 
can also posit a model r^^ that contains an additional unknown parameter 
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vector, say 7. As long as 7 can be estimated at the root-n rate, the re- 
suhing estimator with the estimator 7 plugged in is also referred to as a 
locally efficient estimator. In addition, if model r/2 contains the true r]2, say 
r]2{Y,P x,7g) = r]2{Y,(3 x), and 79 is estimated consistently by 7 at the 
root-n rate, then the resulting estimator S*g with 7/2(^^1 /3 x,7) plugged in 
is efficient. 

Remark 3. Even if efficiency is not sought after and consistency is 
the sole purpose, at least one nonparametric operation, such as one that 
relates to estimating E{-x.\f3 x), is needed. Thus, to completely avoid non- 
parametric procedures, the only option is to impose additional assumptions. 
The most popular linearity condition in the literature assumes -©(xj/S x) = 
(3{(3 j3)~^f3 X. Since Theorem 1 allows an arbitrary 77*, the most obvious 
choice in practice is probably the exponential link functions. For example, 
if we choose 7/2 to be the normal link function when d = 1 , then the locally 
efficient estimator degenerates to a simple form, where 

S^ff = vecl[{x - /3(/3T/3)-i/3Tx}(y - /J^x)]. 

If we are even bolder and decide to replace 1" — /3 x with Y , which is still 
valid given that the first term alone already guarantees consistency under the 
linearity condition, then we obtain the ordinary least square estimator [13]. 
Further connections to other existing methods are elaborated in [20]. 

3.2. The efficient estimator. Now we pursue the truly efficient estimator 
that reaches the semiparametric efficiency bound. This is important because 
in terms of reaching the optimal efficiency, relying on a posited model 7^2 
to be true or to contain the true r/2 is not a satisfying practice. Intuitively, 
it is easy to imagine that in constructing the locally efficient estimator, if 
we posit a larger model 773, the chance of it containing the true model 772 
becomes larger, hence the chance of reaching the optimal efficiency also in- 
creases. Thus, if we can propose the "largest" possible model for 773 ) we will 
guarantee to have 772 containing 772. If we can also estimate the parameters 
in ?72 "correctly," we will then guarantee the efficiency. This "largest" model 
with a "correctly" estimated parameter turns out to be what the nonpara- 
metric estimation is able to provide. This amounts to estimating £^(x|/3 x), 
7/2 and its first derivative nonparametrically in (2.2). 

We first discuss how to estimate 772 and its first derivative, based on 
{Yi,P Xj),7 = 1, . . . ,77. This is a problem of estimating conditional density 
and its derivative. We use the idea of the "double-kernel" local linear smooth- 
ing method studied in [8]. Consider Kb{Y — y) = h~'^K{{Y — y)/b} with 
y running through all possible values, where K{-) is a symmetric density 
function, and 6 > is a bandwidth. Then E{Ki,(Y — y)|/3 x} converges to 
r]2{y,(3 x) as b tends to 0. This observation motivates us to estimate 772 and 
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its first derivative, evaluated at (y,/3 x) through minimizing the fohowing 
weighted least squares: 

n 

Y,{Kb{Yi -y)-a- h^{f5^^i - f5^^)fKh^{(i^^i - (5^^), 

where hy is a bandwidth, and Khy is a multivariate kernel function. The min- 
imizers a and b are the estimators of 772 and dr]2/d{P x). Let the resulting 
estimators be r]2{-) and 7?^(-). 

It remains to estimate £'(x|/3 x). Using the Nadaraya- Watson kernel 
estimator, we have 



where hx is a bandwidth, and Kh^ is a multivariate kernel function. The 
algorithm for obtaining the efficient estimator is the following: 



• 



• 



Step 1. Obtain an initial root-n consistent estimator of /3, denoted as /3, 
through, for example, a simple locally efficient estimation procedure from 
Section 3.1. 

~T 

Step 2. Perform nonparametric estimation of r]2{Y,(3 x) and its first 

~T ~T ^ 

derivative d{r]2{Y,P x)}/(}(/3 x). Write the resulting estimators as r]2{-) 

and rf^i-). 

~T 

Step 3. Perform nonparametric estimation of -E(x|/3 x). Write the result- 
ing estimator as E{-). 

Step 4. Plug r]2(Y,l3^yL), ??^(y,/3'^x) and E{x\l3^x) into Seff and solve the 
estimating equation 



y^^Scs(Yi,Xi, 0^Xi,r]2,rf2, E) = 



i=l 

to obtain the efficient estimator (3. 

In performing the various nonparametric estimations in steps 2 and 3, 
as well as in obtaining the locally efficient estimator in Section 3.1, band- 
widths need to be selected. Because the final estimator is very insensitive to 
the bandwidths, as indicated by conditions (A2), (B2) and Theorems 1, 2, 
where a range of different bandwidths all lead to the same asymptotic prop- 
erty of the final estimator, we suggest that one should select the corre- 
sponding bandwidths by taking the sample size n to its suitable power 
to satisfy (B2), and then multiply a constant to scale it, instead of per- 
forming a full-scale cross validation procedure. For example, when d = 1, 
we let h = n~^'^ ,hx = n~^'^ ,hy = n~^'^,b = n"-*^' ^, and when d=2, we let 
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h = n~^'^ ,hx = n~^'^, hy = n~^''^ , b = n~^'^, each multiplied by the standard 
deviation of the regressors calculated at the current f3 value. 

The estimator from the above algorithm, /3, with its upper dxd submatrix 
being I^, reaches the optimal semiparametric efficiency bound. We present 
this result in Theorem 2. 

Theorem 2. Under conditions (B1)-(B2) and (C1)-(C3), the estimator 
obtained from the estimating equation 

n 

'^Scs{Yi,:sii, ^'^:xii,r]2,rf2, E) =0 

is efficient. Specifically, when n— t-oo, the estimator of vecl{(3) satisfies 

V^{vecl(3) - vecl(/3)} ^ iV(0, [ii;{Seff(y,x,/3Tx,772)®'}]-') 
in distribution. 

Remark 4. It is discovered that for certain p.d.f. i]2, such as when the 
inverse mean function E{x\Y) degenerates, some inverse, regression-based 
methods, such as SIR, would fail to exhaustively recover Syix- However, 
this is not the case for the efficient estimator proposed here. That is, our 
proposed efficient estimator, similar to dMAVE [26], has the exhaustiveness 
property [11]. In fact, as it is listed in the regularity conditions, as long as 
the asymptotic covariance matrix is not singular and is bounded away from 
infinity, our method is always able to produce the efficient estimator. 

Remark 5. It can be easily verified that the above efficient asymptotic 
variance-covariance matrix can be explicitly written out as 



E{S,s{Y,x,(3'^x,r]2f'} 



Ei E 



dlogri2{Y,f3^x) 



®2 



/3^x 



®2|^T, 



M{x,-S(x,|/3^x)n/3^x 



where x^ is the vector formed by the lower p — d components of x. Thus, the 
asymptotic variance of vecl(/3) is nonsingular as long as both E[{dlogr]2{Y, 
/3^x)/c)(/3^x)}®2|/3T^] ^^^ ^[|^^ _ £;(x;|/3Tx)}®2|/3Tx] are nonsingular. 

The nonsingularity of the first matrix is a standard requirement on the in- 
formation matrix of the true model 7/2 and is usually satisfied. On the other 
hand, E{E[{:x.i — ii^(x/|/3 x)}'^^|/3 x]) is always guaranteed to be nonsingu- 
lar. This is because if it is singular, then there exists a unit vector a with the 
first d components zero, such that ax is a deterministic function of /3 x. 
This violates our assumption that ax. cannot be a deterministic function 
of /3 X unless a lies within the column space of /3. 
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4. Simulation study. In this section we conduct simulations to evaluate 
the finite sample performance of our efficient and locally efficient estimators 
and compare them with several existing methods. 

We consider the following three examples: 

(1) We generate Y from a normal population with mean function x /3 
and variance 1. 

(2) We generate Y from a normal population with mean function 
sin(2x /3) + 2exp(2 + x /3) and variance function log{2 + (x /3)^}. 

(3) We generate Y from a normal population with mean function 2(x /3^)^ 
and variance function 2exp(x /32). 

In the simulated examples 1 and 2, we set (3 = (1.3,-1.3,1.0,-0.5,0.5, 
—0.5)"^ and generate x = {Xi, . . . ,Xq)'^ as follows. We generate Xi, X2, ei 
and 62 independently from a standard normal distribution, and form X3 = 
0.2Xi+0.2(X2 + 2)2 + 0.2ei,X4 = 0.1 + 0.1(Xi+X2) + 0.3(Xi + l. 5)2 + 0.262. 
We generate X^ and Xq independently from Bernoulli distributions with 
success probability exp(Xi)/{l + exp(Xi)} and exp(X2)/{l + exp(X2)}, re- 
spectively. 

Example 3 follows the setup of Example 4.2 in [26]. In this example, we set 
/3i = (1, 2/3, 2/3, 0, -1/3, 2/3)^ and P2 = (0-8, 0.8, -0.3, 0.3, 0, 0)^ . We form 
the covariates x by setting Xi = U1 — U2, X2 = U2 — Us — U^, Xs = Us + Ui, 
X4 = 2U4, X^ = U5 + 0.5Uq and Xq = Uq, where Ui is generated from a 
Bernoulli distribution with probability 0.5 to be 1 or —1, [/2 is also gener- 
ated from Bernoulli distribution, with probability 0.7 to be \/o/7 and prob- 
ability 0.3 to be — y/7/3. The remaining four components of u are generated 
from a uniform distribution between — \/3 and ^/S. The six components of 
u = {Ui, . . . , Uq)'^ are independent, marginally having zero mean and unit 
variance. We construct x through u in this way to allow the components of 
X to be correlated. 

For the purpose of comparison, we implement six estimators: "Oracle," 
"Eff," "Local," "dMAVE," "SIR" and "DR." The names of the estimators 
suggest the nature of these estimators, while we briefly explain them in the 
following: 

Oracle: the oracle estimate which correctly specifies r/2 in (2.2), but we esti- 
mate E[x.\(3 x) through kernel regressions. We remark here that the or- 
acle estimator is not a realistic estimator because rj2 is usually unknown. 
We include the oracle estimator here to provide a benchmark since this 
is the best performance one could hope for. 

Eff: the efficient estimator which estimates ii'(x|/3 x), r]2 and r/2 through 
nonparametric regressions. See Section 3.2 for a description about this 
efficient estimator. 
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Table 1 

The average ("live") and the sample standard errors ("'std"j for various estimates, and 

the inference results, respectively, the average of the estimated standard deviation ("'std"^ 

and the coverage of the estimated 95% confidence interval ("95%"), of the oracle 

estimator and the efficient estimator, of (3 in simulated example 1 







/3i 


/32 


/33 


/34 


/35 


/36 






1.3 


-1.3 


1 


-0.5 


0.5 


-0.5 


Oracle 


ave 


1.2978 


-1.3036 


1.0049 


-0.4985 


0.5033 


-0.4943 




std 


0.1221 


0.1477 


0.1505 


0.1169 


0.0966 


0.1049 




std 


0.1264 


0.1510 


0.1527 


0.1212 


0.0983 


0.1052 




95% 


0.9510 


0.9540 


0.9440 


0.9540 


0.9520 


0.9450 


Eff 


ave 


1.2980 


-1.3046 


1.0064 


-0.4990 


0.5040 


-0.4936 




std 


0.1280 


0.1546 


0.1567 


0.1221 


0.1000 


0.1075 




std 


0.1317 


0.1588 


0.1602 


0.1264 


0.1011 


0.1084 




95% 


0.9480 


0.9380 


0.9380 


0.9440 


0.9480 


0.9510 


Local 


ave 


1.3052 


-1.2629 


0.9687 


-0.4988 


0.5023 


-0.4897 




std 


0.1478 


0.1736 


0.1715 


0.1393 


0.1069 


0.1153 


dMAVE 


ave 


1.2599 


-1.2933 


1.0014 


-0.4763 


0.4984 


-0.4935 




std 


0.1932 


0.1427 


0.1550 


0.1701 


0.1368 


0.1378 


SIR 


ave 


1.3881 


-1.1930 


0.9261 


-0.5968 


0.4793 


-0.4724 




std 


0.1696 


0.1522 


0.1414 


0.1489 


0.0976 


0.0995 


DR 


ave 


0.9935 


-0.2217 


0.1930 


-0.6863 


0.1245 


-0.1071 




std 


0.6567 


1.2305 


1.0107 


0.6411 


0.3069 


0.2999 



Local: the locally efficient estimate which mis-specifies the model r/2, and 
estimates E{-\(3 x) through nonparametric regression. This is an imple- 
mentation of (3.1). 

dMAVE: the conditional density based minimum average variance estima- 
tion proposed by [26]. 

SIR: the sliced inverse regression [12] which estimates 13 as the first d prin- 
cipal eigenvectors of 5]~ cov{£^(x|y)}5]~ , where S = cov(x). 

DR: the directional regression [10] which estimates /3 as the first d prin- 
cipal eigenvectors of the kernel matrix S~^/^£'{2Ip - A(y,y)}^5]~^/^, 
where A{Y,Y) = ^-^/^E^Kx - 5)(x - X)T|y,y}S-i/^ and (X,y) is an 
independent copy of (x,y). 

We repeat each experiment 1000 times with sample size n = 500. The 
results are summarized in Table 1 for example 1, Table 2 for example 2 
and Table 3 for example 3. Because the estimators we propose here use a 
different parameterization of the central subspace Sy\^ from the existing 
methods such as SIR, DR or dMAVE, we transform the results from all the 
estimation procedures to the original (3 used to generate the data for a fair 
and intuitive comparison. 
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Table 2 

The average ("icve") and the sample standard errors ('"std"^ for various estimates, and 

the inference results, respectively, the average of the estimated standard deviation ("'std"^ 

and the coverage of the estimated 95% confidence interval ("95%"), of the oracle 

estimator and the efficient estimator, of (3 in simulated example 2 







/3i 


/32 


/33 


/34 


/35 


fie 






1.3 


-1.3 


1 


-0.5 


0.5 


-0.5 


Oracle 


ave 


1.2999 


-1.3001 


1.0001 


-0.4999 


0.5002 


-0.4999 




std 


0.0023 


0.0025 


0.0028 


0.0022 


0.0023 


0.0024 




std 


0.0021 


0.0020 


0.0026 


0.0020 


0.0021 


0.0023 




95% 


0.9260 


0.9070 


0.9270 


0.9220 


0.9210 


0.9380 


EfT 


ave 


1.2996 


-1.2999 


0.9998 


-0.4996 


0.5002 


-0.5000 




std 


0.0116 


0.0116 


0.0117 


0.0111 


0.0068 


0.0079 




std 


0.0123 


0.0124 


0.0124 


0.0120 


0.0075 


0.0081 




95% 


0.9480 


0.9550 


0.9570 


0.9450 


0.9630 


0.9520 


Local 


ave 


1.2992 


-1.3010 


1.0007 


-0.4993 


0.5011 


-0.5001 




std 


0.0155 


0.0210 


0.0209 


0.0140 


0.0142 


0.0147 


dMAVE 


ave 


1.2405 


-1.3422 


1.0303 


-0.4490 


0.5114 


-0.5134 




std 


0.0229 


0.0151 


0.0133 


0.0153 


0.0081 


0.0082 


SIR 


ave 


0.3064 


-1.6387 


1.2390 


0.2477 


0.4697 


-0.4743 




std 


0.1248 


0.3965 


0.3149 


0.1057 


0.1135 


0.1141 


DR 


ave 


0.3424 


0.8686 


-0.6620 


-0.6895 


-0.1923 


0.1912 




std 


0.2550 


1.2518 


0.9653 


0.6938 


0.3360 


0.3410 



From the results in Table 1, we can see that Oracle, Eff, Local, dMAVE 
provide estimators with small bias, while SIR and DR have substantial bias 
in some of the elements in (3. For example, the average of the second esti- 
mated component of /3 obtained by DR is —0.2217, in contrast to the true 
value —1.3. This is because the covariate x does not satisfy the linearity or 
the constant variance condition, and hence violates the requirement of SIR 
and DR. Although Local and dMAVE both appear consistent, they have 
much larger variance in some components than Eff. For example, in esti- 
mating /3i, the asymptotic variance of dMAVE is 0.1932, whereas that of 
Eff is as small as 0.1264. This is not surprising since Eff is asymptotically 
efficient. In fact, for this very simple setting, the estimation variance of Eff 
is almost as good as Oracle, which indicates that the asymptotic efficiency 
already exhibits for n = 500. 

We also provide the average of the estimated standard error using the 
results in Theorem 2 and the 95% coverage in Table 1. The numbers show 
a close approximation of the sample and estimated standard error and 95% 
coverage is reasonable close to the nominal value. 

Similar phenomena are observed for the simulated example 2 from Table 2, 
where SIR and DR are biased, Local and dMAVE are consistent but have 
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1— I 

G 


The average ("a,ve") and the sample 


standard 


errors ('"std"^ for various estimates, and the inference results, respectively, the average of 


H 


the estimated standard deviation ("'std"^ and the coverage 


of the estimated 95% confidence interval ("95%"), of the i 


oracle estimator and 


"2, 










the 


efficient estimator, of /3 in simulated example 3 
















/3ii 


/321 


/331 


/341 


/351 


f3ei 


/3l2 


/322 


/332 


/342 


/352 


/362 


1 1 






1 


0.6667 


0.6667 





-0.3333 


0.6667 


0.8 


0.8 


-0.3 


0.3 








g 

^ 


Oracle 


ave 


1.0009 


0.6676 


0.6674 


0.0002 


-0.3339 


0.6675 


0.8064 


0.8064 


-0.2905 


0.2969 


-0.0047 


0.0053 




std 


0.0305 


0.0305 


0.0325 


0.0099 


0.0198 


0.0314 


0.0860 


0.0860 


0.0902 


0.0291 


0.0550 


0.0854 


o 

2, 

1 — 1 




std 


0.0275 


0.0275 


0.0295 


0.0109 


0.0178 


0.0276 


0.0828 


0.0828 


0.0876 


0.0296 


0.0547 


0.0826 




95% 


0.9270 


0.9270 


0.9300 


0.9590 


0.9200 


0.9110 


0.9410 


0.9410 


0.9320 


0.9450 


0.9520 


0.9430 


"2, 


Eff 


ave 


1.0097 


0.6763 


0.6764 


-0.0000 


-0.3384 


0.6752 


0.8038 


0.8038 


-0.3067 


0.3105 


0.0022 


-0.0003 






std 


0.0714 


0.0714 


0.0745 


0.0162 


0.0434 


0.0740 


0.1737 


0.1737 


0.1993 


0.0485 


0.1511 


0.1895 


1 1 




std 


0.0709 


0.0709 


0.0734 


0.0175 


0.0454 


0.0702 


0.1439 


0.1439 


0.1490 


0.0381 


0.0973 


0.1439 


o 

2 




95% 


0.9280 


0.9280 


0.9350 


0.9530 


0.9460 


0.9430 


0.9230 


0.9230 


0.9240 


0.9410 


0.9150 


0.9080 


local 


ave 


1.0633 


0.7300 


0.7372 


-0.0072 


-0.3701 


0.7468 


0.7689 


0.7689 


-0.3066 


0.2754 


-0.0116 


-0.0042 


^ 

o 




std 


1.8783 


1.8783 


2.1273 


0.2493 


1.0694 


2.3913 


1.1281 


1.1281 


1.5767 


0.4517 


0.2192 


0.2516 


^ 


dMAVE 


ave 


0.8884 


0.6079 


-0.1703 


0.2119 


-0.2498 


0.5065 


0.8282 


0.7722 


-0.0901 


0.2371 


-0.0153 


0.0354 


in 




std 


0.0748 


0.1021 


0.0951 


0.0569 


0.0888 


0.1155 


0.0379 


0.0378 


0.1188 


0.0731 


0.0761 


0.0489 


SIR 


ave 


0.5443 


0.3781 


-0.3301 


0.1816 


-0.0944 


0.1976 


0.7768 


0.6849 


-0.4083 


0.2908 


0.0441 


-0.0828 


O 

2 




std 


0.1514 


0.1414 


0.0863 


0.0586 


0.1257 


0.2022 


0.0650 


0.0808 


0.1098 


0.0748 


0.1059 


0.0831 


a 


DR 


ave 


0.6332 


0.2753 


-0.2968 


0.0939 


-0.2701 


0.5422 


0.7004 


0.6823 


-0.4512 


0.1498 


0.0013 


-0.0151 




std 


0.1813 


0.2009 


0.1003 


0.0739 


0.1288 


0.1567 


0.1063 


0.1446 


0.1688 


0.0880 


0.1639 


0.0945 


o 
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larger variability than EfF and Oracle. In this more complex model where the 
mean function is highly nonlinear and the error is heteroscedastic, we lose 
the proximity between the oracle performance and the Eff performance. This 
is probably because n = 500 is still too small for this model. The inference 
results in Table 2, however, are still satisfactory, indicating that although 
we cannot achieve the theoretical optimality, inference is still sufficiently 
reliable. 

What we observe in Table 3, for the simulated example 3, tells a com- 
pletely different story. For this case with d = 2, both the linearity and the 
constant variance condition are violated. In addition, x contains categori- 
cal variables. dMAVE, SIR and DR all fail to provide good estimators in 
terms of estimation bias. Local and Eff remain to be consistent, although 
like in the simulated example 2, we can no longer hope to see the optimality 
as the estimation standard error is much larger than the Oracle estimator. 
Inference results presented in Table 3 still show satisfactory 95% coverage 
values, while the average estimated estimation standard error can deviate 
away from the sample standard error. This is caused by some numerical 
instability of a small proportion of the simulation repetitions. In fact, if we 
replace the average with the median estimated standard error, the results 
are closer. 

5. An application. We use the proposed efficient estimator to analyze 
a dataset concerning the employees' salary in the Fifth National Bank of 
Springfield [1]. The aim of the study is to understand how an employee's 
salary associates with his/her social characteristics. We regard an employee's 
annual salary as the response variable Y , and several social characteristics 
as the associated covariates. These covariates are, specifically, current job 
level {Xi); number of years working at the bank {X2); age {X3); number of 
years working at other banks (X4); gender (X5); whether the job is computer 
related (Xq). After removing an obvious outlier, the dataset contains 207 
observations. 

We calculated the Pearson correlation coefficients and found the current 
job level (Xi) has the largest correlation with his/her annual salary (Y) 
[corr(Xi,y) = 0.614]. This implies that the current job level is possibly an 
important factor and thus we fix the coefficient of Xi to be 1 in our subse- 
quent analysis. We applied SIR, DR, dMAVE and Eff methods to estimate 
the remaining coefficients. In Figure 1 we present the scatter plots of Y ver- 

sus a single linear combination /3 x, where x = {Xi, . . . , Xq)"^ and /3 denote 
the estimate obtained from the four estimation procedures. The scatter plots 
exhibit similar monotone patterns in that the annual salary increases with 

the value of /3 x. Except for DR, the data cloud of all other three proposals 
looks very compact. To quantify this visual difference, we fit a cubic model 

by regressing Y on 1, (/3 x), (/3 x)^ and {(3 x)"^. The adjusted r^ values are 
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(iii): dMAVE (r^ = 0.852) 
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(ii): DR (r^ = 0.673) 
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(iv): Eff (r^ = 0.841) 



Fig. 1. The scatter plot of Y versus (3 x, with (3 obtained from SIR, DR, dMAVE 
and Eff, respectively. The fitted cubic regression curves (-) and the adjusted r values are 
shown. 



also reported in Figure 1. The r^ value of DR is much smaller than that of 
the other estimators, which suggests worse performance of DR. This is not a 
surprise because DR requires the most stringent conditions on the covariate 
vector X, which are violated here because of the categorical covariates. The 
r^ values of all other estimators including Eff are satisfactory, indicating 
that 5y|x is possibly one dimensional. We would also like to point out that 
because the r^ value factors in the goodness-of-fit of the cubic model, hence 
it only provides a reference. 

Table 4 contains the estimated coefficients /3j's, the standard errors and 
p- values obtained through Eff. It can be seen that in addition to the current 
job level (Xi), working experience at the current bank (X2), age (X3) and 



18 Y. MA AND L. ZHU 

Table 4 
The estimated coefficients and standard errors obtained by Eff 



/32 /33 /34 /35 (36 



EfT 



coef. 


0.477 


0.265 


0.024 


0.050 


0.146 


std. 


0.021 


0.031 


0.030 


0.037 


0.031 


)- value 


<io-'' 


<10~* 


0.427 


0.176 


<io-* 



whether or not the job is computer related (Xq) are also important factors 
on salary. While it is not difficult to understand the importance of most 
of these factors, we believe the age effect is probably caused by its high 
correlation with the working experience [corr(X2,X3) = 0.676]. 

6. Discussion. We have derived both locally efficient and efficient es- 
timators which exhaust the entire central subspace without imposing any 
distributional assumptions. We point out here that if the linearity condition 
holds, the efficiency bound does not change. However, the linearity condition 
will enable a simplification of the computation because we can simply plug 
E(-x.\j3 x) = /3(/3 f!i)^^f3 x into the estimation equation instead of estimat- 
ing it nonparametrically. However, the constant variance condition does not 
seem to contribute to the efficiency bound or to the computational simplic- 
ity. It is therefore a redundant condition in the efficient estimation of the 
central subspace. 

In this paper we did not discuss how to determine d, the structural di- 
mension of 5y|x when an efficient estimation procedure is used, although 
we agree that this is an important issue in the area of dimension reduc- 
tion. In the real-data example, we infer the structural dimension through 
the adjusted r^ values. This seems a reasonable choice, but the turnout may 
depend on how to recover the underlying model structure. How to prescribe 
a rigorous data-driven procedure is needed in future works. 

Various model extensions have been considered in the dimensional reduc- 
tion literature. For example, in partial dimension reduction problems [3], 
it is assumed that F[Y\:x.) = F(Y\P xi,X2). Here, xi is a covariate sub- 
vector of X that the dimension reduction procedure focuses on, while X2 is 
a covariate sub-vector that is known to directly enter the model based on 
scientific understanding or convention. We can see that the semiparametric 
analysis and the efficient estimation results derived here can be adapted to 
these models, through changing /3 x to (/3 xi,X2) in all the corresponding 
functions and expectations while everything else remains unchanged. An- 
other extension is the group- wise dimension reduction [14] , where the model 
i?(y|x) = X^j=i ?n'i(^, x^/3j) is considered. The semiparametric analysis in 
such models requires separate investigation, and it will be interesting to 
study the efficient estimation. 
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SUPPLEMENTARY MATERIAL 

Supplement to "Efficient estimation in sufficient dimension reduction" 

(DOL 10.1214/12-AOS1072SUPP; .pdf). The supplement file aosl072_supp. 
pdf is available upon request. It contains derivations of the efficient score 
for model (2.1) and an outline of proof for Theorems 1 and 2. 
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